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2 Sentinel scheduling: a model for compiler-controlled speculative execution 
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Speculative execution is an important source of parallelism for VLIW and superscalar 
processors. A serious challenge with compiler-controlled speculative execution is to 
efficiently handle exceptions for speculative instructions. In this article, a set of 
architectural features and compile-time scheduling support collectively referred to as 
sentinel scheduling is introduced. Sentinel scheduling provides an effective framework for 
both compiler-controlled speculative executi ... 

Keywords: VIIW processor, exception detection, exception recovery, instruction 
scheduling, instruction-level parallelism, speculative execution, superscalar processor 
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Trends in high-performance computing are making it necessary for long-running 
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applications to tolerate hardware faults. The most commonly used approach is checkpoint 
and restart (CPR) - the state of the computation is saved periodically on disk, and when a 
failure occurs, the computation is restarted from the last saved state. At present, it is the 
responsibility of the programmer to instrument applications for CPR.Our group is 
investigating the use of compiler technology to instrument codes to ... 

Keywords: checkpointing, fault-tolerance, openMP, shared-memory programs 



4 Exploiting dead value information 

Milo M. Martin, Amir Roth, Charles N. Fischer 

December 1997 Proceedings of the 30th annual ACM/IEEE international symposium on 
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Full text available: ^ DdtM 38MB ^ Additional Information: MLcMisn, abstract, references, citjngs, index 
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We describe Dead Value Information (DVI) and introduce three new optimizations which 
exploit it. DVI provides assertions that certain register values are dead, meaning they will 
not be read before being overwritten. The processor can use DVI to track dead registers 
and dynamically eliminate unnecessary save and restore instructions from the execution 
stream at procedure calls and context switches. Our results indicate that dynamic saves and 
restore instances can be reduced by 46% for procedure c ... 

5 Hardware-managed register allocation for embedded processors 
Xiaotong Zhuang, Tao Zhang, Santosh Pande 

June 2004 ACM SIGPLAN Notices , Proceedings of the 2004 ACM SIG P LAN / SIG BED 
conference on Languages, compilers, and tools, volume 39 issue 7 

Full text available: ^pdfC265J8„KBj Additional Information: fuH.citation, abstract, references, indexierms 

Most modern processors (either embedded or general purpose) contain higher number of 
physical registers than those exposed in the ISA. Due to a variety of reasons, this 
phenomenon is likely to continue especially on embedded systems where encoding space is 
very limited. Saving the encoding space leads to lower power consumption in the I-cache; 
on the other hand, harnessing more physical registers saves power in the memory 
subsystem and reduces latency as well. These design decisions however resu ... 

Keywords: architected registers, embedded systems, physical registers, power 
consumption, register allocation 
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Superscalar architectures; 
superscalar processors 

Rajeev Balasubramonian, Sandhya Dwarkadas, David H. Albonesi 

December 2001 Proceedings of the 34th annual ACM/IEEE international symposium on 
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Dynamic superscalar processors execute multiple instructions out-of-order by looking for 
independent operations within a large window. The number of physical registers within the 
processor has a direct impact on the size of this window as most in-flight instructions 
require a new physical register at dispatch. A large multi-ported register file helps improve 
the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, 
especially in future wire-limited technologies. ... 
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Muthian Sivathanu, Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau 
May 2005 ACM Transactions on Storage (TOS), Volume l issue 2 

Full text available: f|gdf{700.30 KBj Additional Information: M citation, abstract, references, index Jerms 

We present the design, implementation, and evaluation of D-GRAID, a gracefully degrading 
and quickly recovering RAID storage array. D-GRAID ensures that most files within the file 
system remain available even when an unexpectedly high number of faults occur. D-GRAID 
achieves high availability through aggressive replication of semantically critical data, and 
fault-isolated placement of logically related data. D-GRAID also recovers from failures 
quickly, restoring only live file system data to a h ... 

Keywords: Block-based storage, Disk array, RAID, fault isolation, file systems, smart disks 
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Static Single Assignment (SSA) is an effective intermediate representation in optimizing 
compilers. However, traditional SSA form and optimizations are not applicable to programs 
represented as native machine instructions because the use of dedicated registers imposed 
by calling conventions, the runtime system, and target architecture must be made explicit. 
We present a simple scheme for converting between programs in machine code and in SSA, 
such that references to dedicated physical registers ... 
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An implementation scheme of fine-grain multithreading that needs no changes to current 
calling standards for sequential languages and modest extensions to sequential compilers is 
described. Like previous similar systems, it performs an asynchronous call as if it were an 
ordinary procedure call, and detaches the callee from the caller when the callee suspends or 
either of them migrates to another processor. Unlike previous similar systems, it detaches 
and connects arbitrary frames generated by of ... 

10 Infe rr ing annotated types for inter- p rocedural re gi ste r allocation with construct or j 
flattening 

Torben Amtoft, Robert Muller 

January 2003 ACM SIGPLAN Notices , Proceedings of the 2003 ACM SIGPLAN 
international workshop on Types in languages design and 
implementation, volume 38 issue 3 

Full text available: ^p.df(268 i 82.KBj Additional Information: full citation, abstract, refere.?lQes, index terms 

We introduce an annotated type system for a compiler intermediate language. The type 
system is designed to support inter-procedural register allocation and the representation of 
tuples and variants directly in the register file. We present an algorithm that generates 
constraints for assigning annotations, and prove its soundness with respect to the type 
system. 
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Code optimization and scheduling for superscalar and superpipelined processors often 
increase the register requirement of programs. For existing instruction sets with a small to 
moderate number of registers, this increased register requirement can be a factor that 
limits the effectless of the compiler. In this paper, we introduce a new architectural 
method for adding a set of extended registers into an architecture. Using a novel concept of 
connection, this method allows the data stored in ... 
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June 1990 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1990 conference 
on Programming language design and implementation, Volume 25 issue 6 
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Though graph coloring algorithms have been shown to work well when applied to register 
allocation problems, the technique has not been generalized for processor architectures in 
which some instructions refer to individual operands that are comprised of multiple 
registers. This paper presents a suitable generalization. 

™ Fortran 8X draft H 
Loren P. Meissner 

December 1989 ACM SIGPLAN Fortran Forum, Volume 8 issue 4 

Full text available: ^pdf(21,36.MS) Additional Information: Ml citation, abstract, .index terms 

Standard Programming Language Fortran. This standard specifies the form and 
establishes the interpretation of programs expressed in the Fortran language. It consists of 
the specification of the language Fortran. No subsets are specified in this standard. The 
previous standard, commonly known as "FORTRAN 77", is entirely contained within this 
standard, known as "Fortran 8x". Therefore, any standard-conforming FORTRAN 77 
program is standard conforming under this standard. New features can b ... 

15 Parallel execution of prolog programs: a survey H 
Gopal Gupta, Enrico Pontelli, Khayri A.M. AM, Mats Carlsson, Manuel V. Hermenegildo 
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Since the early days of logic programming, researchers in the field realized the potential for 
exploitation of parallelism present in the execution of logic programs. Their high-level 
nature, the presence of nondeterminism, and their referential transparency, among other 
characteristics, make logic programs interesting candidates for obtaining speedups through 
parallel execution. At the same time, the fact that the typical applications of logic 
programming frequently involve irregular computatio ... 

Keywords: Automatic parallelization, constraint programming, logic programming, 
parallelism, prolog 
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May 2001 ACM SIGARCH Computer Architecture News , Proceedings of the 28th 
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Additional Information: full citation, abstract, references, citings, index 
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The continuation of the remarkable exponential increases in processing power over the 
recent past faces imminent challenges due in part to the physics of deep-submicron CMOS 
devices and the costs of both chip masks and future fabrication plants. A promising solution 
to these problems is offered by an alternative to CMOS-based computing, chemically 
assembled electronic nanotechnology (CAEN). 

In this paper we outline how CAEN-based computing can become a reality. We briefly 
describe rec ... 
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The trend towards extensible software architectures and component-based software 
development demands safe, efficient, and easy-to-use extension mechanisms to enforce 
protection boundaries among software modules residing in the same address space. This 
paper describes the design, implementation, and evaluation of a novel intra-address space 
protection mechanism called Palladium, which exploits the segmentation and paging 
hardware in the Intel X86 architecture and efficiently supports safe ... 
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This paper presents methods for empirical evaluation of features of Instruction Set 
Processors (ISPs). ISP features are evaluated in terms of the time used or saved by having 
or not having the feature. The methods are based on analysis of traces of program 
executions. The concept of a register life is introduced, and used to answer questions like: 
How many registers are used simultaneously? How many would be sufficient all of the time? 
Most of the time? What would the overhead be if the num ... 

Keywords: computer architecture, execution time, instruction sets, instruction tracing, 
opcode utilization, program behavior, register structures, register utilization, simultaneous 
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Most modern processors (either embedded or general purpose) contain higher number of 
physical registers than those exposed in the ISA. Due to a variety of reasons, this 
phenomenon is likely to continue especially on embedded systems where encoding space is 
very limited. Saving the encoding space leads to lower power consumption in the I-cache; 
on the other hand, harnessing more physical registers saves power in the memory 
subsystem and reduces latency as well. These design decisions however resu ... 
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Physical register access time increases the delaybetween scheduling and execution in 
modern out-of-orderprocessors. As the number of physical registers increases,this delay 
grows, forcing designers to employ register fileswith multicycle access. This paper 
advocates more efficientutilization of a fewer number of physical registers in orderto reduce 
the access time of the physical register file. Registervalues with few significant bits are 
stored in the renamemap using physical register inlining, ... 
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The physical register file is an important component of adynamically-scheduled processor. 
Increasing the amount of parallelismplaces increasing demands on the physical register 
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file,calling for alternative file organization and management strategies. This paper considers 
the use of value locality to optimize theoperation of physical register files. We present 
empirical data showing that: (i) the value producedby an instruction is often the same as a 
value produced by anotherrecently executed instr ... 
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There is a growing class of applications implemented in object-oriented languages that are 
large and complex, that exploit object persistence, and need to run uninterrupted for long 
periods of time. Development and maintenance of such applications can present challenges 
in the following interrelated areas: consistent and scalable evolution of persistent data and 
code, optimal build management, and runtime changes to applications. The research 
presented in this thesis addresses the above issues. ... 
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We describe Dead Value Information (DVI) and introduce three new optimizations which 
exploit it. DVI provides assertions that certain register values are dead, meaning they will 
not be read before being overwritten. The processor can use DVI to track dead registers 
and dynamically eliminate unnecessary save and restore instructions from the execution 
stream at procedure calls and context switches. Our results indicate that dynamic saves and 
restore instances can be reduced by 46% for procedure c ... 
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Superscalar processors currently have the potential to fetch multiple basic blocks per cycle 
by employing one of several recently proposed instruction fetch mechanisms. However, this 
increased fetch bandwidth cannot be exploited unless pipeline stages further downstream 
correspondingly improve. In particular, register renaming a large number of instructions per 
cycle is difficult. A large instruction window, needed to receive multiple basic blocks per 
cycle, will slow down dependence resolution ... 
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Dynamic superscalar processors execute multiple instructions out-of-order by looking for 
independent operations within a large window. The number of physical registers within the 
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processor has a direct impact on the size of this window as most in-flight instructions 
require a new physical register at dispatch. A large multi-ported register file helps improve 
the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, 
especially in future wire-limited technologies. ... 
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Static Single Assignment (SSA) is an effective intermediate representation in optimizing 
compilers. However, traditional SSA form and optimizations are not applicable to programs 
represented as native machine instructions because the use of dedicated registers imposed 
by calling conventions, the runtime system, and target architecture must be made explicit. 
We present a simple scheme for converting between programs in machine code and in SSA, 
such that references to dedicated physical registers ... 
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Understanding distributed applications is a tedious and difficult task. Visualizations based on 
process-time diagrams are often used to obtain a better understanding of the execution of 
the application. The visualization tool we use is Poet, an event tracer developed at the 
University of Waterloo. However, these diagrams are often very complex and do not provide 
the user with the desired overview of the application. In our experience, such tools display 
repeated occurrences of non-trivial commun ... 
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An implementation scheme of fine-grain multithreading that needs no changes to current 
calling standards for sequential languages and modest extensions to sequential compilers is 
described. Like previous similar systems, it performs an asynchronous call as if it were an 
ordinary procedure call, and detaches the callee from the caller when the callee suspends or 
either of them migrates to another processor. Unlike previous similar systems, it detaches 
and connects arbitrary frames generated by of ... 
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Wide, deep pipelines need many physical registersto hold the results of in-flight 
instructions. Simultaneously,high clock frequencies prohibit using largeregister files and 
bypass networks without a significantperformance penalty. Previously proposed 
techniquesusing register caching to reduce this penalty sufferfrom several problems 
including poor insertion andreplacement decisions and the need for a fully-associativecache 
for good performance. We present novelmechanisms for managing and indexin ... 

1 3 Su p er mac hines and Superminds H 
Eric Steinhart 

February 2003 Minds and Machines, volume 13 issue i 

Full text available: p u { 5 ;; s ^ Ar Additional Information: Miration, abstract references, index terms 

If the computational theory of mind is right, then minds are realized by machines. There is 
an ordered complexity hierarchy of machines. Some finite machines realize finitely complex 
minds; some Turing machines realize potentially infinitely complex minds. There are many 
logically possible machines whose powers exceed the Church-Turing limit (e.g. accelerating 
Turing machines). Some of these supermachines realize superminds. Superminds perform 
cognitive supertasks. Their thoughts are fo ... 
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These proceedings record the First International Workshop on Persistence and Java, which 
was held in Drymen, Scotland in September 1996. The focus of this workshop was the 
relationship between the Java languages and long-term data storage, such as databases 
and orthogonal persistence. There are many approaches being taken, some pragmatic and 
some guided by design principles. If future application programmers building large and 
long-lived systems are to be well-supported, it is essential that the ... 
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These proceedings record the Second International Workshop on Persistence and Java, that 
was held in Half Moon Bay in the San Francisco Bay Area, in August 1997. The focus of the 
workshop series is the relationship between the Java platform and longterm storage, such 
as databases and orthogonal persistence. If future application programmers building large 
and longlived systems are to be well supported, it is essential that the lessons of existing 
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The muiticiuster architecture that we introduce offers a decentralized, dynamically- 
scheduled architecture, in which the register files, dispatch queue, and functional units of 
the architecture are distributed across multiple clusters, and each cluster is assigned a 
subset of the architectural registers. The motivation for the muiticiuster architecture is to 
reduce the clock cycle time, relative to a single-cluster architecture with the same number 
of hardware resources, by reducing the size ... 
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The detection and correct handling of data and control dependencies constitutes one of the 
biggest issues to expose ILP in current architectures. The ever increasing memory latencies 
and working space of programmes are making prefetching techniques crucial for the 
attainment of sustained high performance. Software prefetching allows the compiler to use 
information discovered at compile-time to effectively bring needed data before it is used, 
thus hiding all or part of the latency from main me ... 
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The muiticiuster architecture that we introduce offers a decentralized, dynamically- 
scheduled architecture, in which the register files, dispatch queue, and functional units of 
the architecture are distributed across multiple clusters, and each cluster is assigned a 
subset of the architectural registers. The motivation for the muiticiuster architecture is to 
reduce the clock cycle time, relative to a single-cluster architecture with the same number 
of hardware resources, by reducing the size and ... 

Keywords: decentralized architecture, partitioned architecture, static instruction 
scheduling, register allocation 
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To date, the implementation of message passing languages has required hte 
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communications variables (sometimes called ports) either to be limited to the number of 
physical communications registers in the machine or to be mapped to memory. Neither 
solution is satisfactory. Limiting the number of variables decreases modularity and 
efficiency of parallel programs. Mapping variables to memory increases the cost of 
communications and the granularity of parallelism. We present here a new programmi ... 
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